Skip to content

Disentangle disparate 'bounds' ideas in processor #496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jun 22, 2022

Conversation

natecook1000
Copy link
Member

This starts the separation of two different ideas for boundaries in the base input:

  • subjectBounds: These represent the actual subject in the input string. For a String callee, this will cover the entire bounds, while for a Substring these will represent the bounds of the substring in the base.
  • searchBounds: These represent the current search range in the subject. These bounds can be the same as subjectBounds or a subrange when searching for subsequent matches or replacing only in a subrange of a string.

This starts the separation of two different ideas for boundaries in
the base input:

- subjectBounds: These represent the actual subject in the input
  string. For a `String` callee, this will cover the entire bounds,
  while for a `Substring` these will represent the bounds of the
  substring in the base.
- searchBounds: These represent the current search range in the
  subject. These bounds can be the same as `subjectBounds` or a
  subrange when searching for subsequent matches or replacing only
  in a subrange of a string.
We should be testing the higher-level Regex.firstMatch function
instead of the lower-level types directly.
str.replacing(_:with:subrange:) still needs to use separate bounds
for the subject and the search range.
When we move forward while searching for the first match, the search
bounds should stay the same. Only the currentPosition needs to move
forward. This will allow us to implement the \G start of match anchor,
with which /\Gab/ matches "abab" twice, compared with /^ab/, which
only matches once.
With this change, RegexMatchesCollection keeps the subject bounds
and search bounds separately, modifying the search bounds with each
iteration. In addition, the replace methods that only operate on a
subrange can specify that specifically, getting the correct anchor
behavior while only matching within a portion of a string.
@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000 natecook1000 changed the title [WIP] Disentangle disparate 'bounds' ideas in processor Disentangle disparate 'bounds' ideas in processor Jun 20, 2022
@natecook1000 natecook1000 marked this pull request as ready for review June 20, 2022 22:20
@natecook1000 natecook1000 requested a review from milseman June 20, 2022 22:20
@@ -20,44 +20,17 @@ struct MatchError: Error {
}
}

extension Executor {
func _firstMatch(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test method was unused.

Comment on lines 702 to 707
firstMatchTest(#"["a-c"]+"#, input: "abc", match: "a",
syntax: .experimental)
syntax: .experimental, xfail: true)
firstMatchTest(#"["abc"]+"#, input: "cba", match: "cba",
syntax: .experimental)
firstMatchTest(#"["abc"]+"#, input: #""abc""#, match: "abc",
syntax: .experimental)
syntax: .experimental, xfail: true)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what the right behavior is for these experimental ones…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same as the above, but we can also remove them. The quotes are just prettier \Q...\E

Comment on lines 55 to 33
var executor = try _compileRegex(regex, syntax)
executor.engine.enableTracing = enableTracing
let (str, caps) = try executor._firstMatch(
regex, input: input, enableTracing: enableTracing)
let capStrs = caps.map { $0 == nil ? nil : String($0!) }
return (String(str), capStrs)
let regex = try Regex(regexStr)
guard let result = try regex.firstMatch(in: input) else {
throw MatchError("match not found for \(regexStr) in \(input)")
}
let caps = result.output.slices(from: input)
return (String(input[result.range]), caps.map { $0.map(String.init) })
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched this to testing the actual public API instead of the executor.

@@ -411,9 +426,9 @@ extension Processor {

case .consumeBy:
let reg = payload.consumer
guard currentPosition < bounds.upperBound,
guard currentPosition < subjectBounds.upperBound,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subject or search bounds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we, generally, know which one to choose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, we want to be moving and matching with searchBounds; subjectBounds should really only be used to match anchors, not for general purpose index movements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i.e. this was incorrect)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way we could use access control or some better coding convention here? E.g. if we always want search bounds except for anchors, can we isolate the code that refers to subject bounds inside the engine?

Comment on lines 702 to 707
firstMatchTest(#"["a-c"]+"#, input: "abc", match: "a",
syntax: .experimental)
syntax: .experimental, xfail: true)
firstMatchTest(#"["abc"]+"#, input: "cba", match: "cba",
syntax: .experimental)
firstMatchTest(#"["abc"]+"#, input: #""abc""#, match: "abc",
syntax: .experimental)
syntax: .experimental, xfail: true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same as the above, but we can also remove them. The quotes are just prettier \Q...\E

This still isn't quite right - Processor.consume() should really
return the matched range, since it already tracks the starting point
of matching. Having it return the whole range will also prepare us
for supporting `\K`, which will mean `currentPosition` isn't always
the start of the matched range.
This at least clarifies which boundaries are being used.
@natecook1000
Copy link
Member Author

@swift-ci Please test

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor cleanups and testing requests.

These anchors should always match at start/end of line, not just
when .anchorsMatchLineEndings() is turned on.
@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000 natecook1000 merged commit e3e186f into swiftlang:main Jun 22, 2022
@natecook1000 natecook1000 deleted the substring-bounds branch June 22, 2022 21:47
milseman pushed a commit to milseman/swift-experimental-string-processing that referenced this pull request Jun 30, 2022
This separates the two different ideas for boundaries in
the base input:

- subjectBounds: These represent the actual subject in the input
  string. For a `String` callee, this will cover the entire bounds,
  while for a `Substring` these will represent the bounds of the
  substring in the base.
- searchBounds: These represent the current search range in the
  subject. These bounds can be the same as `subjectBounds` or a
  subrange when searching for subsequent matches or replacing only
  in a subrange of a string.

* firstMatch shouldn't update searchBounds on iteration

When we move forward while searching for the first match, the search
bounds should stay the same. Only the currentPosition needs to move
forward. This will allow us to implement the \G start of match anchor,
with which /\Gab/ matches "abab" twice, compared with /^ab/, which
only matches once.

* Make matches(of:) and ranges(of:) boundary-aware

With this change, RegexMatchesCollection keeps the subject bounds
and search bounds separately, modifying the search bounds with each
iteration. In addition, the replace methods that only operate on a
subrange can specify that specifically, getting the correct anchor
behavior while only matching within a portion of a string.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants